Abstract
RNAseq analysis has become a common practice in most molecular biology laboratories in the world. Several tools have been developed to achieve differential expression analysis, functional enrichment, gene co-expression or data visualization among others, from RNAseq data. However, many of them require independent analysis and, in many ocasions, relatively high programming skills. The package RNAseqEasy provides an integration of some of this tools into a single user-friendly workflow, and tries to simplify it to require only basic notions in R programming. It is mainly based on DESeq21, topGO2 and WGCNA3 packages to carry out PCA exploratory plots, differential expression analysis, gene ontology enrichment analysis and gene-coexpression from Salmon4 quantified data. This vignette explains the use of the package and demonstrates typical workflows.
RNAseqEasy package version: 0.0.0.9000
The data used as example in this vignette was generated in the work of Liang et al.5 in the liverwort Marchantia polymorpha.
The RNAseqEasy package is thought to be used from Salmon mapped and
quantified data. The first required input is the path [1] of the
directory where quant.sf samples are saved will be the
input. The other input necessary for the analysis is a data frame [2]
that includes the name of the samples and the information describing
them (i.e. variables or conditions defining the experiment).
Salmon4 is a widely used
software for quantifying transcript abundance. A well detailed tutorial
can be found in their official
webpage. It is a very fast tool for transcript quantification
directly from fastq.gzfiles. The only requirement is the
creation of an index of the transcriptome of the
organism we are working with (do not use the genome!!!), also explained
in their tutorial.
As a result, we will obtain a folder [1] containing one folder per
analyzed sample with its names. In each folder, the main output fail
will be named quant.sf, which contains the name of each
transcript and their abundance in Transcripts Per Million (TPM), among
other information (length, effective length and number of reads). The
path to this output folder will be input for our analysis [1].
# URL from Zenodo
data_url <- "https://zenodo.org/records/15800134/files/GSE275561.zip?download=1"
# Temporal directory to download data
output_dir <- file.path(tempdir(), "GSE275561")
dir.create(output_dir, recursive = TRUE, showWarnings = FALSE)
zip_file <- file.path(tempdir(), "GSE275561.zip")
# Download zip file with Salmon quantified data from Zenodo
download.file(url = data_url, destfile = zip_file, mode = "wb")
# Unzip compressed file
unzip(zip_file, exdir = output_dir)
list.files(output_dir)## [1] "__MACOSX" "02_Salmon" "Sample_Data_Wendy.txt"
So, we set the “02_Salmon” folder as sampleDir, which
will direct the analysis to the directory to retrieve quantified data
for each sample.
For each experiment analysis, it is required to generate a data frame
[2] (we call it sample_table) which correlates the name of
each sequenced sample with the variables describing it in the
experiment. So, it will include a Sample column including
sample names, and an extra column for each variable of factor that were
included in the experiment design. In our example, we include 12
different samples (Wendy1 to Wendy12), and there were 3 different
variables describing them:
Genotype. Samples were divided in two different
genotypes: ‘Tak1’ (wild-type organisms) and ‘gh3a’ (CRISPR-Cas9
mutants).Treatment. Two different treatments were applied to
samples: ‘Mock’ or ‘OPDA’ (exposure to a high concentration of a plant
stress hormone from jasmonic acid family).Replicate. Each category of samples included 3
independent biological replicates.# Data frame with 'Sample', 'Genotype', 'Treatment and 'Replicate' information
sample_table <- data.frame(
Sample = paste0("Wendy", seq(1,12)),
Genotype = rep(c("Tak1", "gh3a"), each = 6),
Treatment = rep(c("Mock", "OPDA"), each = 3),
Replicate = seq(1,3)
)
sample_table## Sample Genotype Treatment Replicate
## 1 Wendy1 Tak1 Mock 1
## 2 Wendy2 Tak1 Mock 2
## 3 Wendy3 Tak1 Mock 3
## 4 Wendy4 Tak1 OPDA 1
## 5 Wendy5 Tak1 OPDA 2
## 6 Wendy6 Tak1 OPDA 3
## 7 Wendy7 gh3a Mock 1
## 8 Wendy8 gh3a Mock 2
## 9 Wendy9 gh3a Mock 3
## 10 Wendy10 gh3a OPDA 1
## 11 Wendy11 gh3a OPDA 2
## 12 Wendy12 gh3a OPDA 3
In differential expression analysis, comparisons are performed against a reference level. By default, R set levels based on alphabetical order for each variable. We can set the specific comparison we want to perform in each analysis, but it is a good practice to set levels order as we want to before we continue. We transform each variable column into factor and establish the level order. In this example, “TaK-1” is the reference genotype, and “Mock” is the reference treatment.
# Convert 'Genotype' into factor, setting "Tak1" as reference
sample_table$Genotype <- factor(sample_table$Genotype,
levels = c("Tak1", "gh3a"))
# Convert 'Treatment' into factor, setting "Mock" as reference
sample_table$Treatment <- factor(sample_table$Treatment, levels = c("Mock", "OPDA"))
str(sample_table)## 'data.frame': 12 obs. of 4 variables:
## $ Sample : chr "Wendy1" "Wendy2" "Wendy3" "Wendy4" ...
## $ Genotype : Factor w/ 2 levels "Tak1","gh3a": 1 1 1 1 1 1 2 2 2 2 ...
## $ Treatment: Factor w/ 2 levels "Mock","OPDA": 1 1 1 2 2 2 1 1 1 2 ...
## $ Replicate: int 1 2 3 1 2 3 1 2 3 1 ...
Everything else required for the analysis is not specific of the experiment you are performing, but depends on the organism you are working with. There are two different files that we will need to import for the analysis:
In this example, we are working with samples extracted from the
liverwort Marchantia polymorpha, a model plant widely use to
study how ancient or conserved biological processes are in plant
evolution. To get the reference transcriptome, we download it from marchantia.info, the Genome Database
for Marchantia polymorpha. The get.fasta.name()
function from phylotools6
package allows us to obtain the transcript names from the fasta file. In
Marchantia, gene names and transcript share the same
nomenclature, only adding a dot and a number to the different
transcripts of a gene. To correlate them, we create a data frame
including a column called Name containing transcript names,
and based on that we create a new column called GENEID
removing last 2 characters (using str_sub() function from
string package).
library(phylotools)
library(tidyverse)
URL <- gzcon(url(paste("https://marchantia.info/download/MpTak_v6.1/", "MpTak_v6.1r1.mrna.fasta.gz", sep="")))
txt <- readLines(URL)
Marchantia7_Transcripts <- get.fasta.name(textConnection(txt), clean_name = FALSE)
head(Marchantia7_Transcripts)## [1] "Mp1g00070.1" "Mp1g00080.1" "Mp1g00090.1" "Mp1g00090.2" "Mp1g00090.3"
## [6] "Mp1g00090.4"
Marchantia7_tx2gene <- data.frame(Name = Marchantia7_Transcripts) %>%
tidyr::separate(Name, sep = " ", into = c("TXNAME", "CDS")) %>%
dplyr::mutate(GENEID = stringr::str_sub(TXNAME, 1, -3)) %>%
dplyr::select(-CDS)
head(Marchantia7_tx2gene, 10)## TXNAME GENEID
## 1 Mp1g00070.1 Mp1g00070
## 2 Mp1g00080.1 Mp1g00080
## 3 Mp1g00090.1 Mp1g00090
## 4 Mp1g00090.2 Mp1g00090
## 5 Mp1g00090.3 Mp1g00090
## 6 Mp1g00090.4 Mp1g00090
## 7 Mp1g00090.5 Mp1g00090
## 8 Mp1g00090.6 Mp1g00090
## 9 Mp1g00100.1 Mp1g00100
## 10 Mp1g00110.1 Mp1g00110
For this example, functional annotation of Gene Ontologies (GOs) for Marchantia genes can be downloaded from the package in a csv file. Many organisms reference databases include this kind of files. If you cannot find them, there are tools and tutorial to annotate your reference genome.
library(RNAseqEasy)
GO_data <- system.file("extdata",
"Mpo6.1_GO_db_GeneGO.csv",
package="RNAseqEasy", mustWork=TRUE)
Mpo_GO_GOSLIM <- read.csv(GO_data, sep = "\t")
head(Mpo_GO_GOSLIM)## Transcript GO
## 1 Mp1g00060 GO:0003676
## 2 Mp1g00060 GO:0032774
## 3 Mp1g00060 GO:0034062
## 4 Mp1g00060 GO:0042644
## 5 Mp1g00060 GO:0046914
## 6 Mp1g00070 GO:0000075
This annotation file is a two column data frame correlating
transcript names and GO terms (one assotiation per row). For topGO GO
enrichement analyses, we will need to convert it into a list containing
transcript as keys and all their annotated GO terms as values. The
load_topGO_db() function included in this package helps us
in this purpose.
## $Mp1g00060
## [1] "GO:0003676" "GO:0032774" "GO:0034062" "GO:0042644" "GO:0046914"
##
## $Mp1g00070
## [1] "GO:0000075" "GO:0000123" "GO:0000725" "GO:0000790" "GO:0000800"
## [6] "GO:0000976" "GO:0001894" "GO:0003682" "GO:0003700" "GO:0004842"
## [11] "GO:0005704" "GO:0005759" "GO:0005813" "GO:0005829" "GO:0005886"
## [16] "GO:0006302" "GO:0007623" "GO:0008023" "GO:0008157" "GO:0008270"
## [21] "GO:0009653" "GO:0010074" "GO:0010332" "GO:0016458" "GO:0016607"
## [26] "GO:0031057" "GO:0031062" "GO:0031436" "GO:0032409" "GO:0032786"
## [31] "GO:0034243" "GO:0035065" "GO:0036464" "GO:0040029" "GO:0042127"
## [36] "GO:0042800" "GO:0042803" "GO:0042981" "GO:0043970" "GO:0044030"
## [41] "GO:0044093" "GO:0044648" "GO:0044666" "GO:0045322" "GO:0045944"
## [46] "GO:0048872" "GO:0051050" "GO:0051053" "GO:0051569" "GO:0065003"
## [51] "GO:0070531" "GO:0070577" "GO:0071339" "GO:0071479" "GO:0080182"
## [56] "GO:0099402" "GO:1902750" "GO:1903507" "GO:1990904" "GO:2000113"
## [61] "GO:2001025" "GO:2001038"
##
## $Mp1g00080
## [1] "GO:0005794" "GO:0009570" "GO:0009941"
##
## $Mp1g00090
## [1] "GO:0005773" "GO:0005829" "GO:0012505" "GO:0031410" "GO:0034272"
## [6] "GO:0098588" "GO:0098805"
##
## $Mp1g00100
## [1] "GO:0000137" "GO:0000138" "GO:0005634" "GO:0005768" "GO:0005797"
## [6] "GO:0005802" "GO:0007155" "GO:0008194" "GO:0009507" "GO:0009832"
## [11] "GO:0010214" "GO:0010395" "GO:0031224" "GO:0045489" "GO:0070592"
## [16] "GO:0080157"
##
## $Mp1g00110
## [1] "GO:0000491" "GO:0001094" "GO:0005634" "GO:0005737" "GO:0042802"
## [6] "GO:0051117" "GO:0070062" "GO:0070761" "GO:0097159" "GO:1901363"
To perform GO semantic clustering, the analysis is based on rrvgo7 and GOSemSim8 packages. The available species for this analysis are included in Bioconductor webpage. The only plant organism included is Arabidopsis thaliana, so we will use its code (org.At.tair.db) for this analysis. We can save this data in an object, so it will save computing time in the following analysis. There are three classes of GO terms: Biological Processes (BP), Molecular Functions (MF) and Cell Components (CC). In this example, we will only be focused on biological processes.
At this point, we already have everything we need to perform our
RNA-seq analysis: a directory with our quantified samples
(sampleDir), a data frame with the experiment metadata
(sample_table), a data frame correlating gene names and
transcript names for our organism (Marchantia7_tx2gene), an
annotation list correlating transcript names and their annotated GO
terms (geneID2GO) and the model organism reference data for
the semantic clustering (save).
We want to obtain differential expressed genes (DEGs) in Tak-1 genotype in OPDA treatment compared to Tak-1 genotype in Mock conditions. To organize our results, we will create a folder to save all differential expression analyses and we will call it “DESeq2”, and inside we will create another folder to save specifically this comparison, and we will call it “Tak1_OPDA”.
dir.create(file.path(output_dir, "DESeq2"))
dir.create(file.path(output_dir, "DESeq2", "Tak1_OPDA"))
output_Dir_DESeq2 <- file.path(output_dir, "DESeq2", "Tak1_OPDA")To run the analysis, we should install DESeq2 and topGO packages from Bioconductor and load them in our R session.
Now, we are ready to run our differential expression analysis. For
that, we will use DESeq2_simple function. We set all
required parameters that we previously defined
(output_path, sampleDir,
sample_table, tx2gene, geneID2GO,
orgdb and semdata). But, still, there are some
parameters that we have to define for this analysis:
For more information about contrasts, read the DESeq2 vignette and check this tutorial.
library(DESeq2)
library(topGO)
DESeq2_simple(
output_path = output_Dir_DESeq2,
sampleDir = sampleDir,
sample_table = sample_table,
tx2gene = Marchantia7_tx2gene,
geneID2GO = geneID2GO,
orgdb = "org.At.tair.db",
semdata = save,
Name = "Tak1_OPDA",
Variable = c("Genotype", "Treatment"),
Design = "Genotype + Treatment + Genotype:Treatment"
)## [1] These are the DESeq2 contrasts, that you have to use like list(c("ContrastA", "ContrastB")) :
## [1] "Intercept" "Genotype_gh3a_vs_Tak1"
## [3] "Treatment_OPDA_vs_Mock" "Genotypegh3a.TreatmentOPDA"
## [1] These are the costumized contrasts, that you have to use like ContrastA - ContrastB :
## [1] "Tak1" "gh3a" "Mock" "OPDA" "Tak1_Mock" "Tak1_OPDA"
## [7] "gh3a_Mock" "gh3a_OPDA"
## Error in DESeq2_simple(output_path = output_Dir_DESeq2, sampleDir = sampleDir, : Choose one of the previous contrasts
After running DESeq2_simple with no Contrast parameter,
there is an error telling us we have to choose one of the previous
contrasts printed in the console. In this example, we want to compare
Tak-1 genotype in OPDA treatment samples vs. Tak-1 genotype in Mock
conditions. There are two different ways we can set this contrast:
Tak1_OPDA_result <- DESeq2_simple(
output_path = output_Dir_DESeq2,
sampleDir = sampleDir,
sample_table = sample_table,
Include = NULL,
Exclude = NULL,
tx2gene = Marchantia7_tx2gene,
Variable = c("Genotype", "Treatment"),
Design = "Genotype + Treatment + Genotype:Treatment",
Group = "NO",
Name = "Tak1_OPDA",
Contrast = Tak1_OPDA - Tak1_Mock,
Reduced = FALSE,
log2FCtopGO = 1,
geneID2GO = geneID2GO,
ontology = "BP",
plot_similarity = TRUE,
orgdb = "org.At.tair.db",
semdata = save
)If we want to perform other possible analyses, these would be the contrast we would have to use:
OPDA vs Mock in gh3a genotype: gh3a_OPDA - gh3a_Mock, or list(c(“Treatment_OPDA_vs_Mock”, “Genotypegh3a.TreatmentOPDA”)).
gh3a vs Tak-1 in Mock conditions: gh3a_Mock - Tak1_Mock, or list(c(“Genotype_gh3a_vs_Tak1”)).
gh3a vs Tak-1 in OPDA conditions: gh3a_OPDA - Tak1_OPDA, or list(c(“Genotype_gh3a_vs_Tak1”, “Genotypegh3a.TreatmentOPDA”)).
Genes affected by the Genotype:Treatment interaction: (gh3a_OPDA - gh3a_Mock) - (Tak1_OPDA - Tak1_Mock), or list(c(“Genotypegh3a.TreatmentOPDA”)). This are genes that will be responding to OPDA treatment in the gh3a genotype in a different way that the expected from the response in Tak-1 genotype.
dir.create(file.path(output_dir, "DESeq2", "Genotype_Treatment_interaction"))
output_Dir_interaction <- file.path(output_dir, "DESeq2", "Genotype_Treatment_interaction")
Genotype_Treatment_result <- DESeq2_simple(
output_path = output_Dir_interaction,
sampleDir = sampleDir,
sample_table = sample_table,
Include = NULL,
Exclude = NULL,
tx2gene = Marchantia7_tx2gene,
Variable = c("Genotype", "Treatment"),
Design = "Genotype + Treatment + Genotype:Treatment",
Group = "NO",
Name = "Genotype_Treatment_interaction",
Contrast = list(c("Genotypegh3a.TreatmentOPDA")),
Reduced = FALSE,
log2FCtopGO = 1,
geneID2GO = geneID2GO,
ontology = "BP",
plot_similarity = TRUE,
orgdb = "org.At.tair.db",
semdata = save
)By default, all samples in sampleDir will be included
for the differential expression analysis using
DESeq2_simple function. However, it could be interesting to
perform an analysis including only a subset of these samples. For
example, let’s think about the scenario where we want to study DEGs by
OPDA in Tak-1 genotype ignoring the existence of gh3a in the
experiment. Thus, we would only want to include samples belonging to
Tak-1 genotype. There are two possible ways of doing that in
DESeq2_simple function: by using Include or
Exclude parameters.
Include: we define features in a vector that all
samples must present to be included in the analysis. In this example, it
would be Include = c("Tak1").Exclude: we define features in a vector that we want to
remove from the analysis. All samples containing any of these features
will be removed. In this example, it would be
Exclude = c("gh3a").In our example, both parameters would work the same way, i.e., only including Tak-1 samples in the analysis, both in OPDA and Mock treatments.
Apart from differential expression analysis, all samples included in
the DESeq2_simple function will be compute for principal
component analysis and we will obtain a png file of PC1 vs PC2 using 500
top genes by default. If we do not want to generate the PCA, we can set
the PCA parameter to FALSE withing the
DESeq2_simple function. If we want to perform a different
principal component analysis (plotting other component, using a
different number of top genes, changing colors, etc…) we can do it by
running the run_pca_analysis function.
After running DESeq2_simple function, there are some
files generated with results from the differential gene expression
analysis:
## baseMean log2FoldChange lfcSE stat pvalue
## Mp1g00070 147.7289 -0.87073377 0.2215426 -3.9303223 8.483207e-05
## Mp1g00080 8077.3196 -0.54069976 0.1329801 -4.0660199 4.782285e-05
## Mp1g00090 531.2465 -0.05551592 0.1356234 -0.4093389 6.822909e-01
## Mp1g00100 1127.5365 0.87159546 0.1302033 6.6941140 2.169823e-11
## Mp1g00110 419.9734 -0.10280589 0.1763512 -0.5829610 5.599196e-01
## Mp1g00120 511.5859 -0.35720864 0.1180411 -3.0261372 2.476999e-03
## padj
## Mp1g00070 3.932574e-04
## Mp1g00080 2.335654e-04
## Mp1g00090 7.772935e-01
## Mp1g00100 2.389950e-10
## Mp1g00110 6.791640e-01
## Mp1g00120 7.944981e-03
## Converting page 1 to Tak1_OPDA_heatmap_1.png... done!
Heatmap of normalized values of DEGs in the OPDA vs Mock comparison for Tak-1 genotype
PCA parameter is set as the
TRUE default value (PCA_Name.png).PCA plot of PC1 vs PC2, using 500 top genes
## Annotated Significant Expected
## GO:0044238 6226 509 528.04
## GO:0045893 688 59 58.35
## GO:0048574 90 9 7.63
## GO:0010154 1454 150 123.32
## GO:0044706 462 48 39.18
## GO:0044281 1878 198 159.28
## Term pval
## GO:0044238 primary metabolic process 0.0061337475
## GO:0045893 positive regulation of DNA-templated transcription 0.0060918982
## GO:0048574 long-day photoperiodism, flowering 0.0049341244
## GO:0010154 fruit development 0.0002647838
## GO:0044706 multi-multicellular organism process 0.0321583675
## GO:0044281 small molecule metabolic process 0.0091339926
## Enrichment
## GO:0044238 1.006982
## GO:0045893 1.056274
## GO:0048574 1.231723
## GO:0010154 1.270691
## GO:0044706 1.279712
## GO:0044281 1.298621
A tree map plot of semantic-clustered GO terms (topGO_Name_All/Up/Down_Treemap.pdf).
A scatter plot of semantic-clustered GO terms (topGO_Name_All/Up/Down_Scatterplot.pdf).
A bubble plot of all enriched GO terms in descendant order according to their enrichment (topGO_Name_All/Up/Down_bubble.pdf).
Tree map of Gene Ontologies enriched in Up-regulated genes
Semantic cluster of Gene Ontologies enriched in Up-regulated genes
Bubble plot of Gene Ontologies enriched in Up-regulated genes
Another application of the package is the generation of co-expressed
gene modules by the implementation of components from the WGCNA3 package. This functionality is
included in the WGCNA_Modules function. As
DESeq2_simple function, it requires to specify the
output_path, sampleDir,
sample_table, Variables, tx2gene,
Name, geneID2GO and Annotation
parameters. The specific parameters that have also to be included in
this function are:
DEGs: A vector of Gene IDs. This will be the input set
of genes that we want to cluster. The Gene IDs must correspond to the
transcriptome that we are using in the analysis.Power: Soft thresholding power used in the network
construction (check the WGNA
manual or tutorial
for more info. If we run the WGCNA_Modules function without
the Power parameter specified, it will stops and generate a
“softthresholdingpower.pdf” file, that can help us to select the power.
As an alternative, this table suggest the power to select depending on
the number of samples and type of network: Most of the advanced parameters in for the
blockwiseModules function of the WGCNA package for the
module generation are set to default.
As an example, we will use the output DEGs from the
Genotype:Treatment interaction in DESeq2_simple, i.e.,
genes that significantly respond to the OPDA treatment in the
gh3a genotype in a different way than the Tak-1 genotype. This
different response can be in different ways, so we will use the gene
co-expression modules to explore this responses.
DESeq2_simple_output <- sigtable <- read.table(file.path(output_Dir_interaction, "Genotype_Treatment_interaction.txt"), header = TRUE)
DEGs <- rownames(DESeq2_simple_output)Now that we already have selected our input gene set, we can run the
WGCNA_Modules function:
library(WGCNA)
dir.create(file.path(output_dir, "WGCNA"))
output_WGCNA <- file.path(output_dir, "WGCNA")
COIOPDA_int_Tak1OPDA_WGCNA <- WGCNA_Modules(output_path =output_WGCNA,
sampleDir = sampleDir,
sample_table = sample_table,
DEGs = DEGs,
Variables = c("Genotype", "Treatment"),
tx2gene = Marchantia7_tx2gene,
Name = "Genotype_Treatment_Interaction",
NumberCol = 1,
geneID2GO = geneID2GO,
semdata = save)## Flagging genes and samples with too many missing values...
## ..step 1
## pickSoftThreshold: will use block size 3269.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 3269 of 13684
## Warning: executing %dopar% sequentially: no parallel backend registered
## ..working on genes 3270 through 6538 of 13684
## ..working on genes 6539 through 9807 of 13684
## ..working on genes 9808 through 13076 of 13684
## ..working on genes 13077 through 13684 of 13684
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.6290 1.600 0.752 5500.0 5620.00 7910
## 2 2 0.0573 0.112 0.132 3090.0 2970.00 5730
## 3 3 0.7080 -0.360 0.665 2010.0 1760.00 4530
## 4 4 0.9310 -0.576 0.913 1420.0 1110.00 3760
## 5 5 0.9680 -0.702 0.959 1060.0 738.00 3210
## 6 6 0.9720 -0.783 0.965 823.0 505.00 2800
## 7 7 0.9740 -0.840 0.967 658.0 357.00 2480
## 8 8 0.9740 -0.877 0.968 538.0 260.00 2220
## 9 9 0.9760 -0.908 0.971 448.0 193.00 2010
## 10 10 0.9740 -0.937 0.969 378.0 146.00 1830
## 11 12 0.9750 -0.976 0.975 280.0 87.00 1550
## 12 14 0.9780 -1.000 0.980 215.0 54.20 1340
## 13 16 0.9770 -1.030 0.980 170.0 36.00 1170
## 14 18 0.9720 -1.050 0.977 137.0 24.80 1040
## 15 20 0.9720 -1.070 0.979 113.0 17.50 931
## 16 22 0.9710 -1.080 0.979 94.2 12.70 841
## 17 24 0.9730 -1.090 0.981 79.7 9.42 764
## 18 26 0.9740 -1.100 0.982 68.2 7.13 699
## 19 28 0.9730 -1.110 0.982 59.0 5.47 642
## 20 30 0.9710 -1.120 0.980 51.4 4.22 593
## 21 32 0.9720 -1.120 0.981 45.2 3.32 549
## 22 34 0.9740 -1.130 0.983 39.9 2.61 511
## 23 36 0.9740 -1.130 0.983 35.5 2.09 477
## 24 38 0.9710 -1.140 0.980 31.8 1.70 446
## 25 40 0.9720 -1.140 0.981 28.6 1.39 419
## Error in WGCNA_Modules(output_path = output_WGCNA, sampleDir = sampleDir, : Choose a power based on softthresholdingpower.pdf
We did not include the Power parameter, so the function
stopped at warned us to select it based on the
“softthresholdingpower.pdf” file.
R^2 values as a function of the soft thresholds
Based on this plot, we will select 22 as the soft thresholding power since it is the first value that rises the threshold.
COIOPDA_int_Tak1OPDA_WGCNA <- WGCNA_Modules(output_path =output_WGCNA,
sampleDir = sampleDir,
sample_table = sample_table,
DEGs = DEGs,
Variables = c("Genotype", "Treatment"),
tx2gene = Marchantia7_tx2gene,
Name = "Genotype_Treatment_Interaction",
Power = 22,
NumberCol = 1,
geneID2GO = geneID2GO,
semdata = save)## Flagging genes and samples with too many missing values...
## ..step 1
## pickSoftThreshold: will use block size 3269.
## pickSoftThreshold: calculating connectivity for given powers...
## ..working on genes 1 through 3269 of 13684
## ..working on genes 3270 through 6538 of 13684
## ..working on genes 6539 through 9807 of 13684
## ..working on genes 9808 through 13076 of 13684
## ..working on genes 13077 through 13684 of 13684
## Power SFT.R.sq slope truncated.R.sq mean.k. median.k. max.k.
## 1 1 0.6290 1.600 0.752 5500.0 5620.00 7910
## 2 2 0.0573 0.112 0.132 3090.0 2970.00 5730
## 3 3 0.7080 -0.360 0.665 2010.0 1760.00 4530
## 4 4 0.9310 -0.576 0.913 1420.0 1110.00 3760
## 5 5 0.9680 -0.702 0.959 1060.0 738.00 3210
## 6 6 0.9720 -0.783 0.965 823.0 505.00 2800
## 7 7 0.9740 -0.840 0.967 658.0 357.00 2480
## 8 8 0.9740 -0.877 0.968 538.0 260.00 2220
## 9 9 0.9760 -0.908 0.971 448.0 193.00 2010
## 10 10 0.9740 -0.937 0.969 378.0 146.00 1830
## 11 12 0.9750 -0.976 0.975 280.0 87.00 1550
## 12 14 0.9780 -1.000 0.980 215.0 54.20 1340
## 13 16 0.9770 -1.030 0.980 170.0 36.00 1170
## 14 18 0.9720 -1.050 0.977 137.0 24.80 1040
## 15 20 0.9720 -1.070 0.979 113.0 17.50 931
## 16 22 0.9710 -1.080 0.979 94.2 12.70 841
## 17 24 0.9730 -1.090 0.981 79.7 9.42 764
## 18 26 0.9740 -1.100 0.982 68.2 7.13 699
## 19 28 0.9730 -1.110 0.982 59.0 5.47 642
## 20 30 0.9710 -1.120 0.980 51.4 4.22 593
## 21 32 0.9720 -1.120 0.981 45.2 3.32 549
## 22 34 0.9740 -1.130 0.983 39.9 2.61 511
## 23 36 0.9740 -1.130 0.983 35.5 2.09 477
## 24 38 0.9710 -1.140 0.980 31.8 1.70 446
## 25 40 0.9720 -1.140 0.981 28.6 1.39 419
## Calculating module eigengenes block-wise from all genes
## Flagging genes and samples with too many missing values...
## ..step 1
## ....pre-clustering genes to determine blocks..
## Projective K-means:
## ..k-means clustering..
## ..merging smaller clusters...
## Block sizes:
## gBlocks
## 1 2 3
## 4997 4543 4144
## ..Working on block 1 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 1 into file Salva-block.1.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..Working on block 2 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 2 into file Salva-block.2.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..Working on block 3 .
## TOM calculation: adjacency..
## ..will not use multithreading.
## Fraction of slow calculations: 0.000000
## ..connectivity..
## ..matrix multiplication (system BLAS)..
## ..normalization..
## ..done.
## ..saving TOM for block 3 into file Salva-block.3.RData
## ....clustering..
## ....detecting modules..
## ....calculating module eigengenes..
## ....checking kME in modules..
## ..merging modules that are too close..
## mergeCloseModules: Merging modules whose distance is less than 0.25
## Calculating new MEs...
## Warning in RColorBrewer::brewer.pal(length(unique_groups), "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
There are 10 different clusters that have been generated after the analysis. The average behavior of genes from each category are summarized in the “Summary_Modules_Name.pdf” file:
Summary of gene expression behavior in each co-expression module of genes responding to the Genotype:Treatment interaction